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Distracted driving has been considered one of the reasons for traffic 
accidents. The american national highway traffic safety administration 
(NHTSA) defines distracted driving as any activity that takes attention away 
from driving, such as doing makeup, texting, calling, and reaching behind. 
Most deaths, physical injuries, and economic losses could have been 
prevented if the distracted driver is alerted on time. This paper has proposed 
a new convolutional neural network (CNN) called DistractNet to detect 
drivers' distractions. The proposed model was trained and tested by state 
farm distracted driver detection image datasets available at Kaggle that 
contains images of drivers in the most common activities performed, which 
lead to distraction while driving divided into ten classes. Also, we have 
studied the performances of the proposed CNN model based on accuracy, 
training time, and model size. The performance of the proposed model was 
compared with four pre-trained networks such as ResNet-50, GoogLeNet, 
InceptionV3, and AlexNet using transfer learning techniques. The obtained 
experimental results show that the developed model-based CNN can achieve 
an overage accuracy of more than 99.32% with 93 min of training time and 
7.99 MB of size. The extracted model can classify driver states into ten 
different classes with the predicted label and probability % for each class. 
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1. INTRODUCTION 


In the United States, every day about eight people are killed in crashes due to a distracted driving 
[1]. According to the American national highway traffic safety administration (NHTSA), in 2018, distracted 
driving was responsible for 2,841 deaths and 400,000 injuries in a motor vehicle crash [2]. Distraction can be 
divided into three main types: visual, manual, and cognitive distraction [3]. The first type of distraction is 
defined as all tasks requiring the driver to look away from the roadway, such as texting, and reaching behind. 
Manual distraction represents all tasks that require the driver to take a hand off the steering wheel, likes 
doing makeup, operating the radio, and drinking. In cognitive distraction, the driver thinking about something 
other than the driving task, likes talking on the phone, and talking to the passenger. In recent years, distracted 
driver detection has received attention from the research community. Many solutions and approaches have 
been proposed to prevent road accidents. Also, the assistance systems and driving monitoring provide an 
efficient solution to reduce road accidents caused by distracted driving [4]. 
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The first previously used method is based on image processing using deep learning techniques for 
features extraction that can be used to interpret the state of driver. The second class of techniques based on 
driving signals, such as acceleration, speed, gravity, and revolutions per minute. These driving signals can be 
accessed through the standard on-board diagnostics (OBD-II) port or using external physical sensors like 
accelerometer, and gyroscope. In this paper, we have designed a new convolutional neural network (CNN) 
[5], [6] model-based distracted driver detection, study and evaluate the performances of the developed model 
in terms of accuracy, and compare the performances of developed CNN with pretrained models using transfer 
learning techniques. The remainder of this paper is organized as follows: the rest of this section introduces 
some related works already published about distracted driver detection. In section 2, we describe the 
materials and methods used to develop the proposed CNN model. Section 3 provides the experimental result 
and comparison with related works. Section 4 provides a summary of the main findings of the work and 
suggestions for future research. 

In related work [7], [8], the authors have proposed a portable system for monitoring and controlling 
driver behavior. The system was designed to acquire data from the vehicle using OBD-I. Similarly, in the 
paper, Ahmed et al. [9] have used the accelerometer and gyroscope sensors implemented in smartphones to 
detect distracted driving. To collect data from sensors, the authors have experimented with 16 subjects 
instructed to driving in most activities that led to distracting driving. The collected data have been considered 
as input of the machine learning classifier. The obtained system can reach accuracy in detecting un-distracted 
driving of 98.76%. In the paper, Shahverdy et al. [10], the authors aimed to classify the driving styles into 
five classes, including distracted, normal, aggressive, drowsy, and drunk driving using driving signals like 
acceleration, gravity, and throttle. In the paper, Craye and Karray [11], the authors proposed a module to 
detect distraction driving and recognize the type of distraction based on features extraction, including facial 
expressions, eye behavior, head movement, and arm position using computer vision techniques. The 
proposed module can achieve an accuracy score of 90% for distraction detection and 85% for recognition of 
the type of distraction. In the paper, Kutila et al. [12] have introduced a camera vision system to detect visual 
and cognitive distraction driving based on support vector machine (SVM) classification methods [13], [14]. 
This method achieved an accuracy of 80% for visual distraction and 86% for cognitive distraction. In the 
paper, Mbouna ef al. [15], the authors have analyzed the eye state and head pose to monitor driver alertness. 
To extract the most critical information the authors have used visual features like pupil activity, eye index, 
and head pose. These features will be passed through a SVM to classify the driver’s state. The obtained 
system can reach an accuracy of 91%. 


2. MATERIALS AND METHOD 

This section describes the materials and method: dataset description and method used to develop the 
CNN model architecture to detect and classify distracted drivers in teen class. For the training and test 
processing platform, we have used a machine with the following hardware characteristics: 3.6 GHz Intel (R) 
CPU Core i5-8350U. We have also used Matlab R2018b software for the development and execution of the 
proposed algorithm. 


2.1. Dataset description 

Several datasets are proposed for detection and classification of distracted driving. The American 
University in Cairo (AUC) distracted driver’s dataset [16] consists of 17308 RGB images with a size of 
1080x1920 obtained from a video of 11 participants from different ethnicities. The video input was acquired 
using a phone camera with a resolution of up to 1080x1920 pixels, placed on top of the passenger’s seat to 
capture the driver’s state in different vehicles and driving conditions. 

The state farm distracted drivers dataset [17] available at Kaggle contains images of drivers in the 
most common activities performed, leading to distracted driving divided into ten classes. Figure 1 as shows 
the image samples of each class from the statefarm’s dataset containing 22,424 RGB images with a size of 
640x480. In Figure | there are pictures of safe driving shown in Figure 1(a), texting (right hand) as shown in 
Figure 1(b), talking (right hand) as shown in Figure 1(c), texting (left hand) as shown in Figure 1(d), talking 
(left hand) as shown in Figure I(e), operating the radio as shown in Figure 1I(f), drinking as shown in 
Figure 1(g), reaching behind as shown in Figure 1(h), hair and makeup as shown in Figure 1(i), and talking to 
passenger(s) as shown in Figure 1(j). 


2.2. Proposed method 
The proposed method is composed of five steps: 
— Data preprocessing: statefarm’s dataset [17] contains images with a size of 640x480 pixels. According to 
the input layer of proposed CNNs, we have resized all images dataset to the size of 224x224 pixels, 
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227x227, and 299x299 pixels, using the “imresize” function available in Image Processing Toolbox in 
Matlab software. 

— Split data: processed images from the dataset have been divided into two categories to prepare the 
proposed CNN model, 70% (15,697 Images) of the dataset for training and 30% (6,727 Images) for 
testing. Table 1 shows the total samples for each class. 

— Create and configure the CNN model: In this step, we have defined the CNN layers, including 
convolutional, pooling, and fully connected layer, each layer having a filter at different resolutions 
defined by width, depth, and height. The proposed CNN DistractNet was designed and trained using 
MATLAB with deep learning toolbox. 

— Train CNN: the commonly used training options are: max epochs and learning rate value. The first one 
defines the elapsed time for training using the entire dataset, and the second controls the training speed. 

— Evaluate CNN’s performances: we have evaluated the performances of the proposed CNNs based on 
classification accuracy, training time, execution time, and model size. 


(f) 


Figure 1. Image samples from statefarm’s dataset, (a) safe driving, (b) texting (right hand), (c) talking (right 
hand), (d) texting (left hand) (e) talking (left hand), (f) operating the radio, (g) drinking, (h) reaching behind, 
(i) hair and makeup, and (j) talking to passenger(s) 


Table 1. Total samples from statefarm’s dataset for each class 


Class Samples _ Training (70%) Test (30%) 

A-Safe driving 2489 1742 TAT 
B-Texting (right hand) 2267 1587 680 
C-Talking on the phone (right hand) 2317 1622 695 
D-Texting (left hand) 2346 1642 704 
E-Talking on the phone (left hand) 2326 1628 698 
F-Operating the radio 2312 1618 694 
G-Drinking 2325 1628 698 
H-Reaching behind 2002 1401 601 

I-Hair and makeup 1911 1338 573 
J-Talking to passenger(s) 2129 1490 639 
Total Samples 22424 15697 6729 


2.3. Proposed CNN DistractNet 

The proposed CNN DistractNet contains many layers, such as the input layer, output layer, and 
seven hidden layers, as shown in Figure 2. These layers are divided into two categories: features detection 
layers and classification layers. The feature detection layers consist of convolutional layers, pooling layers. 
The first layer consists of an input image with dimensions of 227x227x3 followed by a convolutional layer 
with 16 filters of size 3x3 resulting in dimension of 227x227x16. We have used three convolutional layers to 
extract the various features such as edges, color, and gradient orientation. The third layer is max poling with 
filter size 2x2 and stride of 2, resulting in an image dimension of 113x113x16. The pooling layer simplifies 
the output by reducing the number of parameters that the network needs to learn about. In the same way, the 
fourth layer is convolutional with 32 filters of size 3x3 followed by max pooling layer with a filter size of 
2x2 and stride of 2, resulting in reduced image dimension 56x56x32. The sixth layer is convolutional with a 
64-filter size of 3x3 and stride of 1 to have an image dimension of 56x56x32. 
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Finally, after several convolutional and pooling layers, the CNN shifts to classification. For that, we 
have used the fully connected layer (FC). The final layer of the DistractNet architecture is softmax, with 10 
possible classes to provide the classification output and to get the probabilities for each class of the input 
classified image. Table 2 shows the training DistractNet accuracy and loss with varying epochs and 
iterations. 


2.4. Transfer learning 

Transfer learning [18]—[20] is currently the most and popular technique used in deep learning when 
the target dataset is relatively small or insufficient. The purpose of transfer learning is to reuse of a pre- 
trained CNN architecture for a similar or new related task [21]. In this study, we have used four pre-trained 
networks, such as ResNet-50 [22], GoogLeNet [23], inceptionV3 [24], and AlexNet [25]. Table 3 introduces 
an overview of the pre-trained CNNs used in this work. 


Conv1 ' Conv2 
Max Pooli 
shad eee Max Pooling2 
Softmax 
: be 
2x2 axa a 3x3 C 10 
ax Stride=2 Stride=1 5 = - Stride=1 
Stride=1 32 kernels ae 64 kernels 
16 kernels 
227X227X3 227X227X16 113X133X16 113X133X32 56x56x32 56x56x64 


Figure 2. Proposed DistractNet architecture 


Table 2. Training and loss process of DistractNet 
Epoch _ Iteration Accuracy _Loss Learning rate 


1 1 8.59% 2.8339 0.001 
1 50 71.88% 0.6480 0.001 
1 100 93.75% 0.2187 0.001 
2 150 99.22% 0.0397 0.001 
2 200 96.88% 0.0938 0.001 
2 244 96.09% 0.1417 0.001 


Table 3. Comparison of different CNNs model 


Network Year Layers _ Size input _ Parameters 
ResNet-50 2015 50 224x224x3 23M 
GoogLeNet 2014 22 224x224x3 7M 


InceptionV3 2015 48 299x299x3 23.6 M 
AlexNet 2012 8 227x227x3 61M 


ResNet-50 [22] consists of 50 layers, 48 convolutional layers, and 1 maxpooll and 1 average pool 
layer. This CNN can be used on computer vision problems, including image classification, and object 
detection. The image input size of ResNet-50 is 224x224x3 with 23 million parameters. GoogLeNet [23] is a 
CNN that contains 22 layers. This network has two versions; one trained by ImageNet datasets [26], and can 
classify images into 1,000 object classes. The other version trained by places365 [27] datasets but classify 
images into 365 different place categories. The pretrained versions both have an image input size of 
224x224x3. inceptionV3 [24], developed in 2015, has 48 deep layers trained and tested on more than a 
million images from the ImageNet datasets [26]. Inceptionv3 primarily aims to consume less computing 
power using less than 25 million parameters by modifying previous Inception architectures. The network has 
an image input size of 299x299x3. AlexNet [25] is a CNN that contains eight layers; three fully connected 
layers and five convolutional layers. The AlexNet can predect images into 1000 object categories. This CNN 
was trained on large image datasets such as the ImageNet [26], which contains 1.2 million of images divided 
into 1000 different classes. For this network, the size of the input image is 227x227x3. Figure 3 shows the 
training progress and loss plot with varying epochs and iterations for all CNNs used in this work, as shown in 
Figure 3(a) and Figure 3(b). 
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Figure 3. Training and loss process of CNN models, (a) accuracy rates for training sets, and (b) loss values 
for training sets 


3. RESULTS AND DISCUSSION 
3.1. Result and analysis 

Once the proposed DistractNet is trained from scratch, we can then evaluate the performances and 
demonstrate the efficiency of our proposed system. For that, we have used the confusion matrix [28] as 
shown in Figure 4 and calculated the accuracy for each class. 6729 of samples dataset contains images of 
drivers of various ethnicities in different scenarios, and conditions that have been used for testing. The 
overall accuracy represents the total number of correct predictions among the total number of the dataset 
(correct predictions + incorrect predictions) as shown in (1): 


TP+TN 


Accuracy = —————_- (1) 
TP+TN+FP+FN 


where, true positive (TP) is the number of positive class records correctly classified, true negative (TN) is is 
the number of negative class records correctly classified, false positive (FP) is the number of negative class 
records incorrectly classified, and false negative (FN) is the number of positive class records incorrectly 
classified. 

According to the results obtained from the confusion matrix for each pretrained network as shown in 
Figure 5, we have observed that “Reaching behind” and “Talking to passenger(s)” classes are confused and 
misclassified with each other. The reason is that the position of the head is the same in those classes. 
Similarly, “Talking (right hand)” is confused with “Hair and makeup”. 


Confusion matrix of DistractNet 
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Figure 4. Confusion matrix of proposed CNN DistractNet 
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Table 4 as shows the performances of all CNN models used in this work. The accuracy of 
classification obtained using ResNet-50 as shown in Figure 5(a), GoogLeNet as shown in Figure 5(b), 
InceptionV3 as shown in Figure 5(c) and AlexNet as shown in Figure 5(d) were 98.16%, 97.96%, 98.02%, 
and 98.44% respectively for 22,424 images, two epochs of training, and 0.001 learning rate value. The 
obtained experimental results also demonstrate that the proposed model DistractNet is more accurate than all 
pre-trained networks with the smallest size (7.99 Mbit). The size of ResNet-50, GoogLeNet, InceptionV3 and 
AlexNet was 91.4 MB, 94.4 MB, 92 MB, and 629 MB. The model size is proportional to CNN architecture 
including, layers number, and parameters. We have also compared the execution time (speed) in seconds that 
a classification would take for all the proposed CNN models on central processing unit (CPU). 


Confusion matrix of ResNet-50 


Confusion matrix of GoogLeNet 


738 
J J 0 0 0 8 0 ex 
I 0 I 0 0 7 +0 0 
- 590 
H 6 H 0 5 O iy 
G ) 6 (COMMON 683 
a = | 443 
gf . gf oa: | 
b a) iz 
zi 7 fe) zi 7 7 
D ) ) D 3 0 
c ) ) c 
B 0 O f°) B 
A 0 0 ) A 
A B C D E F GH Ig A B C DB E F GH I J 
Predicted Predicted 
(a) (b) 


Confusion matrix of InceptionV3 Confusion matrix of AlexNet 


Actual 
Actual 


Predicted Predicted 


(c) (d) 


Figure 5. Confusion’s matrix of pretrained networks, (a) ResNet-50, (b) GoogLeNet, (c) InceptionV3, 
and (d) AlexNet 


The proposed CNN DistractNet has 2 million parameters in total, which is the number of weight and 
bias of convolution (Conv) and FC. The training time for transfer learning is small compared to training from 
scratch because, in transfer learning, the pre-trained model has already learned the weights based on previous 
learnings. This technique can reduce training time and computing resources. Also, the training time depends 
on many factors, such as the number of image datasets, networks architecture, and processing platform 
performances like processor, RAM, and Graphics. Figure 6 provides a comparison of the CNNs used in this 
work. The vertical axis shows the accuracy of classification. The horizontal axis shows the training time in 
minutes for 2 epochs. Circle size represents the number of parameters for each network. 


DistractNet: a deep convolutional neural network architecture for ... (Ismail Nasri) 


500 0 ISSN: 2252-8938 


Table 4. Performences comparison between DistractNet and pretrained networks 
CNN Model Accuracy % Learning time (mm:ss) Model size (MB) Execution time (Speed) 


DistractNet 99.32% 92:54 7.99 0.0299 s 
ResNet-50 98.16% 75:15 91.4 0.0672 s 
GoogLeNet 97.96% 39:11 94.2 0.0463 s 
InceptionV3 98.02% 79:02 92 0.0563 s 

AlexNet 98.44% 33:58 629 0.0878 s 


In this part, we have evaluated the impact of the number of training images datasets on the accuracy 
of classification. The result shows that the best overall accuracy is obtained when the number of images is 
22,424. In deep learning, precisely in CNNs, the number of images significantly affected the classification 
accuracy of the CNN models. Therefore, producing higher accurate results requires a large number of image 
datasets [29]. Figure 7 shows the accuracy of CNN networks with a different number of training images 
datasets. 
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Figure 6. Accuracy, learning time, and number of parameters of CNNs models 
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Figure 7. Accuracy comparison of CNN models with varying image number of dataset 


3.2. Comparison with related works 
All previously methods used for distracted driver detection in smart transportation systems [30]— 
[32] can be divided into two main classes. The first used method is based on data collected from sensors such 
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as accelerometer and gyroscope using OBD-II or external sensors. The authors in [9], [10] aimed to detect 
distraction based on driving signals. This class of methods takes advantage of the number of sensors 
available within vehicles and can provide significant results. The accuracy associated with detection 
distractions reached more than 98%. 

On the other hand, the second class is based on images processing captured by a camera module 
installed inside the vehicle to monitor driver behavior and classified using a machine learning algorithm, 
including CNN, Random Forests algorithm, and SVM. The authors in [11], [12] developed a machine 
learning classifier based on features extraction from image input to predict the driver behavior. Table 5 as 
shows the comparison of the performances of different methods. 


Table 5. Accuracy comparison between proposed model and related works 


Reference Class Method Accuracy% 
[9] Sensor Accelerometer, gyroscope using Random Forests based algorithm 98.76% 
[10] Sensor Acceleration, throttle, gravity, speed, and revolutions per minute (RPM) - 
11] Vision facial expressions, eye behavior, head movement, and arm position using 90% 
computer vision techniques 
[12] Vision Support vector machine (SVM) 80% 
[15] Vision Eye state and head pose using (SVM) 91% 
Proposed Vision convolutional neural network (CNN) 99.32% 
DistractNet 


4. CONCLUSION 

In this work, we have proposed a new CNN named DistracNet for distracted driver detection and 
recognition nine of types of the most activities conducted to distraction such as texting, talking, operating the 
radio, and reaching behind. The StateFarm’s distracted driver detection dataset was used for training and 
testing the proposed CNN model. Compared to other related works, the proposed model DistractNet was the 
most accurate model with an average accuracy of 99.32%. Also, the experiment results demonstrate that 
DistractNet has significant performances in terms of training time and model size. 

In future work, we will focus on implementing the proposed CNN model in an embedded system 
able to be used in real-time to monitor driver states. Also, as an extension of this work, we will communicate 
the proposed system with other electronic control units (ECUs) in the vehicle network to take further 
necessary action to avoid accidents by warning drivers in a distraction state using a sound alarm, and text 
message. 
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